Francesco Mora - 100439601
Jose Maria Martínez Marín - 100443343
Nowadays, electricity networks of advanced countries rely more and more in non-operable renewable energy sources, mainly wind and solar. However, in order to integrate energy sources in the electricity network, it is required that the amount of energy to be generated to be forecasted 24 hours in advance, so that energy plants connected to the electricity network can be planned and prepared to meet supply and demand during the next day (For more details, check “Electricity Market” at Wikipedia).
This is not an issue for traditional energy sources (gas, oil, hydropower, …) because they can be generated at will (by burning more gas, for example). But solar and wind energies are not under the control of the energy operator (i.e. they are non-operable), because they depend on the weather. Therefore, they must be forecasted with high accuracy. This can be achieved to some extent by accurate weather forecasts. The Global Forecast System (GFS, USA) and the European Centre for Medium-Range Weather Forecasts (ECMWF) are two of the most important Numerical Weather Prediction models (NWP) for this purpose.
Yet, although NWP’s are very good at predicting accurately variables like “100-meter U wind component”, related to wind speed, the relation between those variables and the electricity actually produced is not straightforward. Machine Learning models can be used for this task.
In particular, we are going to use meteorological variables forecasted by ECMWF (http://www.ecmwf.int/) as input attributes to a machine learning model that is able to estimate how much energy is going to be produced at the Sotavento experimental wind farm (http://www.sotaventogalicia.com/en).
More concretely, we intend to train a machine learning model f, so that:
• Given the 00:00am ECMWF forecast for variables A6:00, B6:00, C6:00, … at 6:00 am (i.e. six hours in advance)
• f(A6:00, B6:00, C6:00, …) = electricity generated at Sotavento at 6:00
We will assume that we are not experts on wind energy generation (not too far away from the truth, actually). This means we are not sure which meteorological variables are the most relevant, so we will use many of them, and let the machine learning models and attribute selection algorithms select the relevant ones. Specifically, 22 variables will be used. Some of them are clearly related to wind energy production (like “100 metre U wind component”), others not so clearly (“Leaf area index, high vegetation”). Also, it is common practice to use the value of those variables, not just at the location of interest (Sotavento in this case), but at points in a grid around Sotavento. A 5x5 grid will be used in this case.
Therefore, each meteorological variable has been instantiated at 25 different locations (location 13 is actually Sotavento). That is why, for instance, attribute iews appears 25 times in the dataset (iews.1, iews.2, …, iews.13, …, iews.25). Therefore, the dataset contains 22*25 = 550 input attributes.
Introduce NA's, fit the NA's, divide the dataset, scale the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import graphviz
import scipy.stats as stats
import optuna
import plotly
import math
from sklearn import tree, neighbors
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import PredefinedSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn import metrics
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import KNNImputer
from skopt import BayesSearchCV
from skopt.space import Integer, Real, Categorical
from scipy.stats import uniform, expon
from scipy.stats import randint as sp_randint
import optuna.visualization
import pickle
import os
from numpy.random import randint
from sklearn.datasets import make_regression
from sklearn import ensemble
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
import lightgbm
from lightgbm.sklearn import LGBMRegressor
import catboost
from catboost import CatBoostRegressor
from sklearn.dummy import DummyRegressor
from sklearn.feature_selection import SelectKBest, f_regression, chi2
from sklearn.pipeline import Pipeline
data = pd.read_pickle('wind_pickle.pickle')
#introduce some random NA
my_NIA = 100439601
np.random.seed(my_NIA)
how_many_nas = round(data.shape[0]*data.shape[1]*0.05)
print('Lets put '+str(how_many_nas)+' missing values \n')
x_locations = randint(0, data.shape[0],size=how_many_nas)
y_locations = randint(6, data.shape[1]-1,size=how_many_nas)
for i in range(len(x_locations)):
data.iat[x_locations[i], y_locations[i]] = np.nan
data.to_pickle('wind_pickle_with_nan.pickle')
data = pd.read_pickle('wind_pickle_with_nan.pickle')
data=data.values
#deal with the NA
imputer = KNNImputer(n_neighbors=2, weights="uniform")
data = imputer.fit_transform(data)
#divide into X and y, and scale the data
X=np.delete(data, np.s_[0], axis=1)
scaler = preprocessing.StandardScaler().fit(X)
X = scaler.transform(X)
y = data[:,0]
#divide the dataset into training, validation and test sets
X_train = X[0:2528,:]
y_train = y[0:2528]
X_val = X[2529:3827,:]
y_val = y[2529:3827]
X_test = X[3828:,:]
y_test = y[3828:]
X_train = X_train[:,5:]
X_val = X_val[:,5:]
X_test = X_test[:,5:]
split_index = [-1]*3828
for i in range(2529,3828):
split_index[i]=0
tr_val_partition = PredefinedSplit(test_fold=split_index)
Lets put 165049 missing values
Train KNN, Random Forest, and Gradient Boosting models with and without hyper-parameter tuning.
Both BayesSearch and Optuna can be used (but Optuna will be graded better). If you use advanced implementations of GradientBoosting such as XGboost or LightBoost, your work will get higher grades.
Compare all of them using the test partition, so that we can conclude what is the best method. Also, compare results with those of trivial (dummy) models.
1a) Train a KNN model with default hyper-parameters
As a performance measure, the MAE is used. With MAE, instances with large errors don't have so much weight on the average.
#KneighborsRegressor without parameter tuning
np.random.seed(0)
neigh = KNeighborsRegressor()
start = time.time()
neigh = neigh.fit(X_train,y_train)
end = time.time()
totaltime_knn = end-start
y_train_pred = neigh.predict(X_train)
y_test_pred = neigh.predict(X_test)
error_knn = metrics.mean_absolute_error(y_test,y_test_pred)
final_model_tree = neigh.fit(X,y)
print('The error (MAE) of the KNeighbors Regressor without tuning (K=',list(neigh.get_params().values())[5],') is: ')
print('test set: ', error_knn)
The error (MAE) of the KNeighbors Regressor without tuning (K= 5 ) is: test set: 335.23580085348505
1b) Train a KNN model with hyper-parameter tuning The hyper-parameters that have been considered are:
The budget for the inner evaluation is set to 10.
The best parameters are then used for the outer evaluation.
#KneighborsRegressor with hyper-parameter tuning
#inner evaluation
def objective(trial):
n_neighbors = trial.suggest_int("n_neighbors",2,60,step=4)
weights = trial.suggest_categorical('weights', ['uniform','distance'])
p = trial.suggest_categorical('p', [1,2])
clf = neighbors.KNeighborsRegressor(n_neighbors=n_neighbors, weights=weights, p=p)
clf.fit(X_train, y_train)
y_val_pred = clf.predict(X_val)
inner_mae = metrics.mean_absolute_error(y_val, y_val_pred)
return inner_mae
budget=10
np.random.seed(0)
knnoptuna = optuna.create_study(direction="minimize")
start = time.time()
knnoptuna.optimize(objective,n_trials=budget)
print('The parameters chosen by inner evaluation are: ')
print(knnoptuna.best_params)
#outer evaluation
clf = neighbors.KNeighborsRegressor(**knnoptuna.best_params)
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
end = time.time()
totaltime_knnoptuna = end-start
error_knn_optuna = metrics.mean_absolute_error(y_test, y_test_pred)
print('This parameters are taken to train the model.')
print('\nThe model is then tested and the outer MAE is:')
print(error_knn_optuna)
[I 2021-01-11 17:23:07,446] A new study created in memory with name: no-name-f8eabb43-33c2-4656-9918-5d6b9e520ecf C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. warnings.warn( [I 2021-01-11 17:23:12,261] Trial 0 finished with value: 323.4388123428757 and parameters: {'n_neighbors': 38, 'weights': 'uniform', 'p': 2}. Best is trial 0 with value: 323.4388123428757. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. warnings.warn( [I 2021-01-11 17:23:15,888] Trial 1 finished with value: 315.6724812516385 and parameters: {'n_neighbors': 38, 'weights': 'distance', 'p': 1}. Best is trial 1 with value: 315.6724812516385. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. warnings.warn( [I 2021-01-11 17:23:19,236] Trial 2 finished with value: 356.3225192604006 and parameters: {'n_neighbors': 2, 'weights': 'uniform', 'p': 1}. Best is trial 1 with value: 315.6724812516385. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. warnings.warn( [I 2021-01-11 17:23:23,692] Trial 3 finished with value: 326.85144353135877 and parameters: {'n_neighbors': 54, 'weights': 'uniform', 'p': 2}. Best is trial 1 with value: 315.6724812516385. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. warnings.warn( [I 2021-01-11 17:23:27,201] Trial 4 finished with value: 317.33360088395096 and parameters: {'n_neighbors': 38, 'weights': 'uniform', 'p': 1}. Best is trial 1 with value: 315.6724812516385. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. warnings.warn( [I 2021-01-11 17:23:30,510] Trial 5 finished with value: 356.126818311496 and parameters: {'n_neighbors': 2, 'weights': 'distance', 'p': 1}. Best is trial 1 with value: 315.6724812516385. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. warnings.warn( [I 2021-01-11 17:23:35,053] Trial 6 finished with value: 320.6640429331839 and parameters: {'n_neighbors': 22, 'weights': 'uniform', 'p': 2}. Best is trial 1 with value: 315.6724812516385. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. warnings.warn( [I 2021-01-11 17:23:39,259] Trial 7 finished with value: 319.6371035210866 and parameters: {'n_neighbors': 54, 'weights': 'uniform', 'p': 1}. Best is trial 1 with value: 315.6724812516385. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. warnings.warn( [I 2021-01-11 17:23:43,399] Trial 8 finished with value: 313.3424521532736 and parameters: {'n_neighbors': 26, 'weights': 'distance', 'p': 1}. Best is trial 8 with value: 313.3424521532736. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. warnings.warn( [I 2021-01-11 17:23:48,111] Trial 9 finished with value: 323.43164347865496 and parameters: {'n_neighbors': 34, 'weights': 'uniform', 'p': 2}. Best is trial 8 with value: 313.3424521532736.
The parameters chosen by inner evaluation are:
{'n_neighbors': 26, 'weights': 'distance', 'p': 1}
This parameters are taken to train the model.
The model is then tested and the outer MAE is:
325.7151434853307
optuna.visualization.plot_contour(knnoptuna)
The graphs above highlight the contour plots of the different combinations of parameters.
The graph of n_neighbors vs weights shows that the function has one area wich is lighter (for n_neighbors bigger than 20).
optuna.visualization.plot_parallel_coordinate(knnoptuna)
There is not a clear trend from this graph, because the function has many local optimal points.
optuna.visualization.plot_param_importances(knnoptuna)
n_neighbors is by far the most significative parameter, followed by p.
optuna.visualization.plot_slice(knnoptuna)
In general, it seems that p=1 is better than p=2 but there's an extreme point, and about weights distance it's not very clear.
The most interesting values for n_neighbors are the ones between 20 and 40.
optuna.visualization.plot_optimization_history(knnoptuna)
1c) Train a Random Forest model with default hyper-parameters
#RandomForestRegressor without parameter tuning
np.random.seed(0)
rf = RandomForestRegressor()
start = time.time()
rf = rf.fit(X_train,y_train)
end = time.time()
totaltime_randfor_notun = end-start
y_test_pred = rf.predict(X_test)
error_randfor_notun = metrics.mean_absolute_error(y_test,y_test_pred)
print('The error (MAE) of the Random Forest Regressor without tuning is: ')
print('test set: ', error_randfor_notun)
The error (MAE) of the Random Forest Regressor without tuning is: test set: 289.4331080132764
1d) Train a Random Forest model with hyper-parameter tuning The hyper-parameters that have been considered are:
The budget for the inner evaluation is set to 15.
The best parameters are then used for the outer evaluation.
#RandomForestRegressor with parameter tuning
#inner evaluation
def objective(trial):
max_depth = trial.suggest_int("max_depth",2,40, step=4)
min_samples_split = trial.suggest_int("min_samples_split",2,40, step=4)
min_samples_leaf = trial.suggest_int("min_samples_leaf",2,20, step=2)
n_estimators = trial.suggest_int("n_estimators",2,60, step=4)
clf = ensemble.RandomForestRegressor(max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, n_estimators=n_estimators)
clf.fit(X_train, y_train)
y_val_pred = clf.predict(X_val)
inner_mae = metrics.mean_absolute_error(y_val, y_val_pred)
return inner_mae
budget=15
np.random.seed(0)
randforoptuna = optuna.create_study(direction="minimize")
start = time.time()
randforoptuna.optimize(objective,n_trials=budget)
print('The parameters chosen by inner evaluation are: ')
print(randforoptuna.best_params)
#outer evaluation
clf = ensemble.RandomForestRegressor(**randforoptuna.best_params)
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
end = time.time()
totaltime_randforoptuna = end-start
error_randfor_optuna = metrics.mean_absolute_error(y_test, y_test_pred)
print('This parameters are taken to train the model.')
print('\nThe model is then tested and the outer MAE is:')
print(error_randfor_optuna)
[I 2021-01-11 17:27:27,172] A new study created in memory with name: no-name-4bf36460-0fcf-43b1-bcdd-2cf6e7f274fd C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:27:34,207] Trial 0 finished with value: 285.8216629504205 and parameters: {'max_depth': 10, 'min_samples_split': 26, 'min_samples_leaf': 16, 'n_estimators': 10}. Best is trial 0 with value: 285.8216629504205. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:27:45,347] Trial 1 finished with value: 284.2027828615766 and parameters: {'max_depth': 38, 'min_samples_split': 26, 'min_samples_leaf': 4, 'n_estimators': 10}. Best is trial 1 with value: 284.2027828615766. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:28:21,691] Trial 2 finished with value: 276.6380728024183 and parameters: {'max_depth': 30, 'min_samples_split': 30, 'min_samples_leaf': 4, 'n_estimators': 38}. Best is trial 2 with value: 276.6380728024183. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:28:40,660] Trial 3 finished with value: 283.65617322484405 and parameters: {'max_depth': 38, 'min_samples_split': 30, 'min_samples_leaf': 2, 'n_estimators': 18}. Best is trial 2 with value: 276.6380728024183. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:29:16,581] Trial 4 finished with value: 279.4073974741218 and parameters: {'max_depth': 30, 'min_samples_split': 38, 'min_samples_leaf': 18, 'n_estimators': 46}. Best is trial 2 with value: 276.6380728024183. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:29:39,444] Trial 5 finished with value: 288.47825128496413 and parameters: {'max_depth': 6, 'min_samples_split': 2, 'min_samples_leaf': 18, 'n_estimators': 42}. Best is trial 2 with value: 276.6380728024183. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:30:07,640] Trial 6 finished with value: 279.80642178963393 and parameters: {'max_depth': 26, 'min_samples_split': 2, 'min_samples_leaf': 18, 'n_estimators': 34}. Best is trial 2 with value: 276.6380728024183. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:30:18,459] Trial 7 finished with value: 293.4622486260847 and parameters: {'max_depth': 6, 'min_samples_split': 22, 'min_samples_leaf': 2, 'n_estimators': 22}. Best is trial 2 with value: 276.6380728024183. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:30:25,538] Trial 8 finished with value: 282.7389500379642 and parameters: {'max_depth': 10, 'min_samples_split': 10, 'min_samples_leaf': 6, 'n_estimators': 10}. Best is trial 2 with value: 276.6380728024183. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:31:08,514] Trial 9 finished with value: 276.7801819609312 and parameters: {'max_depth': 38, 'min_samples_split': 10, 'min_samples_leaf': 8, 'n_estimators': 50}. Best is trial 2 with value: 276.6380728024183. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:31:31,366] Trial 10 finished with value: 279.7984038004112 and parameters: {'max_depth': 18, 'min_samples_split': 38, 'min_samples_leaf': 12, 'n_estimators': 30}. Best is trial 2 with value: 276.6380728024183. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:32:27,745] Trial 11 finished with value: 275.67539365649856 and parameters: {'max_depth': 30, 'min_samples_split': 14, 'min_samples_leaf': 8, 'n_estimators': 58}. Best is trial 11 with value: 275.67539365649856. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:33:19,412] Trial 12 finished with value: 275.0101654152411 and parameters: {'max_depth': 30, 'min_samples_split': 14, 'min_samples_leaf': 10, 'n_estimators': 58}. Best is trial 12 with value: 275.0101654152411. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:34:23,631] Trial 13 finished with value: 275.25998616912625 and parameters: {'max_depth': 22, 'min_samples_split': 14, 'min_samples_leaf': 12, 'n_estimators': 58}. Best is trial 12 with value: 275.0101654152411. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 40] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 38]. C:\Users\cecco\anaconda3\lib\site-packages\optuna\distributions.py:566: UserWarning: The distribution is specified by [2, 60] and step=4, but the range is not divisible by `step`. It will be replaced by [2, 58]. [I 2021-01-11 17:35:15,112] Trial 14 finished with value: 277.49903143999074 and parameters: {'max_depth': 22, 'min_samples_split': 14, 'min_samples_leaf': 12, 'n_estimators': 58}. Best is trial 12 with value: 275.0101654152411.
The parameters chosen by inner evaluation are:
{'max_depth': 30, 'min_samples_split': 14, 'min_samples_leaf': 10, 'n_estimators': 58}
This parameters are taken to train the model.
The model is then tested and the outer MAE is:
288.05325233188057
optuna.visualization.plot_contour(randforoptuna)
The contour plots show clearly the shape of the objective function for each combination.
The lighter portions of the graphs are the ones with lowest objective function value.
optuna.visualization.plot_parallel_coordinate(randforoptuna)
optuna.visualization.plot_param_importances(randforoptuna)
The parameters n_estimators and max_depth are by far the most significant among the 4.
optuna.visualization.plot_slice(randforoptuna)
The objective function has lower values for max_depth close to 30 and min_sample_leaf close to 10. In the n_estimators graph it seems that the function decreases as n_estimators increases.
optuna.visualization.plot_optimization_history(randforoptuna)
1e) Train a Gradient Boosting model with default hyper-parameters
#XGBoost without parameter tuning
np.random.seed(0)
gf = XGBRegressor()
start = time.time()
gf = gf.fit(X_train,y_train)
end = time.time()
totaltime_xgb_notun = end-start
y_test_pred = gf.predict(X_test)
error_xgb_notun = metrics.mean_absolute_error(y_test,y_test_pred)
print('The errors (MAE) of the Gradient Boosting Regressor without tuning is: ')
print('test set: ', error_xgb_notun)
The errors (MAE) of the Gradient Boosting Regressor without tuning is: test set: 303.0110758406633
1f) Train a Gradient Boosting model with hyper-parameter tuning The hyper-parameters that have been considered are:
The budget for the inner evaluation is set to 20.
The best parameters are then used for the outer evaluation.
#XGBoost with parameter tuning
#inner evaluation
def objective(trial):
max_depth = trial.suggest_int("max_depth",2,40)
n_estimators = trial.suggest_int("n_estimators",2,20)
learning_rate = trial.suggest_uniform("learning_rate",0.05,0.4)
clf = xgb.XGBRegressor(max_depth=max_depth, n_estimators=n_estimators, learning_rate=learning_rate)
clf.fit(X_train, y_train)
y_val_pred = clf.predict(X_val)
inner_mae = metrics.mean_absolute_error(y_val, y_val_pred)
return inner_mae
budget=20
np.random.seed(0)
xgboostoptuna = optuna.create_study(direction="minimize")
start = time.time()
xgboostoptuna.optimize(objective,n_trials=budget)
print('The parameters chosen by inner evaluation are: ')
print(xgboostoptuna.best_params)
#outer evaluation
clf = xgb.XGBRegressor(**xgboostoptuna.best_params)
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
end = time.time()
totaltime_xgboostoptuna = end-start
error_xgboost_optuna = metrics.mean_absolute_error(y_test, y_test_pred)
print('This parameters are taken to train the model.')
print('\nThe model is then tested and the outer MSE is:')
print(error_xgboost_optuna)
[I 2021-01-11 15:40:31,401] A new study created in memory with name: no-name-e2ba3ca2-913a-4d03-957d-1c68183b3cdf [I 2021-01-11 15:40:40,647] Trial 0 finished with value: 300.1082343834758 and parameters: {'max_depth': 15, 'n_estimators': 17, 'learning_rate': 0.13627653427051556}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:40:45,716] Trial 1 finished with value: 323.2691631886478 and parameters: {'max_depth': 40, 'n_estimators': 7, 'learning_rate': 0.19838991673112638}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:40:57,669] Trial 2 finished with value: 306.80404172912034 and parameters: {'max_depth': 21, 'n_estimators': 20, 'learning_rate': 0.2077319343453673}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:41:14,478] Trial 3 finished with value: 302.92594987399036 and parameters: {'max_depth': 37, 'n_estimators': 17, 'learning_rate': 0.16647897185417548}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:41:19,247] Trial 4 finished with value: 328.94695587986973 and parameters: {'max_depth': 28, 'n_estimators': 5, 'learning_rate': 0.2526262716010467}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:41:27,218] Trial 5 finished with value: 303.52774781831056 and parameters: {'max_depth': 15, 'n_estimators': 12, 'learning_rate': 0.16543916934907638}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:41:32,382] Trial 6 finished with value: 304.6272053625404 and parameters: {'max_depth': 10, 'n_estimators': 12, 'learning_rate': 0.1402738259877592}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:41:41,485] Trial 7 finished with value: 303.9736358987752 and parameters: {'max_depth': 16, 'n_estimators': 15, 'learning_rate': 0.23912942058002107}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:41:51,888] Trial 8 finished with value: 314.7126087699502 and parameters: {'max_depth': 34, 'n_estimators': 12, 'learning_rate': 0.3773130569964846}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:41:56,712] Trial 9 finished with value: 312.77749372399273 and parameters: {'max_depth': 36, 'n_estimators': 7, 'learning_rate': 0.3612028463482027}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:41:59,396] Trial 10 finished with value: 351.53460245940647 and parameters: {'max_depth': 3, 'n_estimators': 20, 'learning_rate': 0.06062956369110012}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:42:11,206] Trial 11 finished with value: 334.840027371746 and parameters: {'max_depth': 25, 'n_estimators': 16, 'learning_rate': 0.08046125527839132}. Best is trial 0 with value: 300.1082343834758. [I 2021-01-11 15:42:14,788] Trial 12 finished with value: 290.5869567111681 and parameters: {'max_depth': 5, 'n_estimators': 17, 'learning_rate': 0.11688858955179085}. Best is trial 12 with value: 290.5869567111681. [I 2021-01-11 15:42:16,361] Trial 13 finished with value: 338.2138450135876 and parameters: {'max_depth': 2, 'n_estimators': 18, 'learning_rate': 0.0971436723818332}. Best is trial 12 with value: 290.5869567111681. [I 2021-01-11 15:42:21,167] Trial 14 finished with value: 302.274413620724 and parameters: {'max_depth': 8, 'n_estimators': 14, 'learning_rate': 0.11753681179498227}. Best is trial 12 with value: 290.5869567111681. [I 2021-01-11 15:42:28,693] Trial 15 finished with value: 296.0767363693975 and parameters: {'max_depth': 9, 'n_estimators': 19, 'learning_rate': 0.294181603960341}. Best is trial 12 with value: 290.5869567111681. [I 2021-01-11 15:42:33,901] Trial 16 finished with value: 296.2785752008729 and parameters: {'max_depth': 6, 'n_estimators': 20, 'learning_rate': 0.3025013350402427}. Best is trial 12 with value: 290.5869567111681. [I 2021-01-11 15:42:42,219] Trial 17 finished with value: 298.3437167432018 and parameters: {'max_depth': 11, 'n_estimators': 19, 'learning_rate': 0.29527423702748173}. Best is trial 12 with value: 290.5869567111681. [I 2021-01-11 15:42:43,636] Trial 18 finished with value: 304.04981391800936 and parameters: {'max_depth': 2, 'n_estimators': 15, 'learning_rate': 0.3142393667993598}. Best is trial 12 with value: 290.5869567111681. [I 2021-01-11 15:42:46,053] Trial 19 finished with value: 291.78698956341145 and parameters: {'max_depth': 6, 'n_estimators': 9, 'learning_rate': 0.2775906329138362}. Best is trial 12 with value: 290.5869567111681.
The parameters chosen by inner evaluation are:
{'max_depth': 5, 'n_estimators': 17, 'learning_rate': 0.11688858955179085}
This parameters are taken to train the model.
The model is then tested and the outer MSE is:
307.0532633553257
optuna.visualization.plot_contour(xgboostoptuna)
In the most interesting parts of the contour plots, the surface corresponding to the lower values is quite small.
optuna.visualization.plot_parallel_coordinate(xgboostoptuna)
optuna.visualization.plot_param_importances(xgboostoptuna)
n_estimators and max_depth are more or less equally important, while learning_rate is more important.
optuna.visualization.plot_slice(xgboostoptuna)
The situations described by the three scatterplots are not so clear. Anyway it seems that the most interesting areas are:
optuna.visualization.plot_optimization_history(xgboostoptuna)
1g) Train a Light Gradient Boosting model with hyper-parameter tuning
#lightGBM with parameter tuning
#inner evaluation
def objective(trial):
max_depth = trial.suggest_int("max_depth",2,40)
n_estimators = trial.suggest_int("n_estimators",2,80)
learning_rate = trial.suggest_uniform("learning_rate",0.05,0.4)
num_leaves = trial.suggest_int("num_leaves",2,10)
clf = lightgbm.LGBMRegressor(max_depth=max_depth, n_estimators=n_estimators, learning_rate=learning_rate, num_leaves=num_leaves)
clf.fit(X_train, y_train)
y_val_pred = clf.predict(X_val)
inner_mae = metrics.mean_absolute_error(y_val, y_val_pred)
return inner_mae
budget=30
np.random.seed(0)
lgbmoptuna = optuna.create_study(direction="minimize")
start = time.time()
lgbmoptuna.optimize(objective,n_trials=budget)
print('The parameters chosen by inner evaluation are: ')
print(lgbmoptuna.best_params)
#outer evaluation
clf = lightgbm.LGBMRegressor(**xgboostoptuna.best_params)
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
end = time.time()
totaltime_lgbmoptuna = end-start
error_lgbm_optuna = metrics.mean_absolute_error(y_test, y_test_pred)
print('This parameters are taken to train the model.')
print('\nThe model is then tested and the outer MAE is:')
print(error_lgbm_optuna)
[I 2021-01-11 15:45:23,189] A new study created in memory with name: no-name-f4ab1803-9826-4845-af28-c538978df71f [I 2021-01-11 15:45:23,579] Trial 0 finished with value: 331.64915781729684 and parameters: {'max_depth': 21, 'n_estimators': 7, 'learning_rate': 0.3512518893831219, 'num_leaves': 3}. Best is trial 0 with value: 331.64915781729684. [I 2021-01-11 15:45:24,063] Trial 1 finished with value: 311.51880886819833 and parameters: {'max_depth': 17, 'n_estimators': 34, 'learning_rate': 0.2450359284026315, 'num_leaves': 2}. Best is trial 1 with value: 311.51880886819833. [I 2021-01-11 15:45:25,050] Trial 2 finished with value: 287.06103821022685 and parameters: {'max_depth': 12, 'n_estimators': 43, 'learning_rate': 0.3545745918438637, 'num_leaves': 8}. Best is trial 2 with value: 287.06103821022685. [I 2021-01-11 15:45:26,627] Trial 3 finished with value: 278.93885548309805 and parameters: {'max_depth': 40, 'n_estimators': 67, 'learning_rate': 0.2053138060718348, 'num_leaves': 9}. Best is trial 3 with value: 278.93885548309805. [I 2021-01-11 15:45:28,553] Trial 4 finished with value: 280.9757549114948 and parameters: {'max_depth': 15, 'n_estimators': 77, 'learning_rate': 0.05868456846463186, 'num_leaves': 8}. Best is trial 3 with value: 278.93885548309805. [I 2021-01-11 15:45:29,814] Trial 5 finished with value: 281.2306602844631 and parameters: {'max_depth': 34, 'n_estimators': 47, 'learning_rate': 0.25425957315363595, 'num_leaves': 9}. Best is trial 3 with value: 278.93885548309805. [I 2021-01-11 15:45:30,632] Trial 6 finished with value: 320.2547195649087 and parameters: {'max_depth': 10, 'n_estimators': 52, 'learning_rate': 0.05752787467197293, 'num_leaves': 3}. Best is trial 3 with value: 278.93885548309805. [I 2021-01-11 15:45:31,772] Trial 7 finished with value: 292.31205982883415 and parameters: {'max_depth': 17, 'n_estimators': 79, 'learning_rate': 0.2237522415543281, 'num_leaves': 3}. Best is trial 3 with value: 278.93885548309805. [I 2021-01-11 15:45:33,482] Trial 8 finished with value: 279.1381303579481 and parameters: {'max_depth': 37, 'n_estimators': 57, 'learning_rate': 0.1378402932299589, 'num_leaves': 9}. Best is trial 3 with value: 278.93885548309805. [I 2021-01-11 15:45:34,230] Trial 9 finished with value: 299.7717222917997 and parameters: {'max_depth': 32, 'n_estimators': 80, 'learning_rate': 0.2964936905809123, 'num_leaves': 2}. Best is trial 3 with value: 278.93885548309805. [I 2021-01-11 15:45:35,784] Trial 10 finished with value: 283.10497707736914 and parameters: {'max_depth': 3, 'n_estimators': 66, 'learning_rate': 0.142412832660039, 'num_leaves': 6}. Best is trial 3 with value: 278.93885548309805. [I 2021-01-11 15:45:38,289] Trial 11 finished with value: 277.59257732977505 and parameters: {'max_depth': 39, 'n_estimators': 62, 'learning_rate': 0.1386236743841544, 'num_leaves': 10}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:40,561] Trial 12 finished with value: 278.8046346893794 and parameters: {'max_depth': 27, 'n_estimators': 65, 'learning_rate': 0.16307102442438495, 'num_leaves': 10}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:42,085] Trial 13 finished with value: 278.5813599624514 and parameters: {'max_depth': 26, 'n_estimators': 29, 'learning_rate': 0.14171664449475177, 'num_leaves': 10}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:43,295] Trial 14 finished with value: 295.7415559041552 and parameters: {'max_depth': 27, 'n_estimators': 27, 'learning_rate': 0.09806615551001183, 'num_leaves': 6}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:44,411] Trial 15 finished with value: 298.2280781584159 and parameters: {'max_depth': 27, 'n_estimators': 18, 'learning_rate': 0.10586180144178656, 'num_leaves': 10}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:45,329] Trial 16 finished with value: 281.7691456682776 and parameters: {'max_depth': 40, 'n_estimators': 20, 'learning_rate': 0.18131094159970934, 'num_leaves': 7}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:46,924] Trial 17 finished with value: 281.1745604704373 and parameters: {'max_depth': 23, 'n_estimators': 35, 'learning_rate': 0.11147483633646862, 'num_leaves': 10}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:47,741] Trial 18 finished with value: 403.64715989747725 and parameters: {'max_depth': 32, 'n_estimators': 12, 'learning_rate': 0.05353096770491145, 'num_leaves': 8}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:48,678] Trial 19 finished with value: 286.3749148185561 and parameters: {'max_depth': 5, 'n_estimators': 31, 'learning_rate': 0.18776821138350047, 'num_leaves': 5}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:49,945] Trial 20 finished with value: 285.0489007368633 and parameters: {'max_depth': 36, 'n_estimators': 56, 'learning_rate': 0.08559438400965183, 'num_leaves': 5}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:52,318] Trial 21 finished with value: 278.1974511999784 and parameters: {'max_depth': 28, 'n_estimators': 67, 'learning_rate': 0.14994052925461657, 'num_leaves': 10}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:54,322] Trial 22 finished with value: 279.102850107626 and parameters: {'max_depth': 24, 'n_estimators': 71, 'learning_rate': 0.1486504659652291, 'num_leaves': 10}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:55,431] Trial 23 finished with value: 281.7220766927206 and parameters: {'max_depth': 29, 'n_estimators': 40, 'learning_rate': 0.17111505321585826, 'num_leaves': 9}. Best is trial 11 with value: 277.59257732977505. [I 2021-01-11 15:45:57,096] Trial 24 finished with value: 277.5442322875385 and parameters: {'max_depth': 29, 'n_estimators': 59, 'learning_rate': 0.11216807846299492, 'num_leaves': 10}. Best is trial 24 with value: 277.5442322875385. [I 2021-01-11 15:45:58,390] Trial 25 finished with value: 281.6009367553902 and parameters: {'max_depth': 31, 'n_estimators': 61, 'learning_rate': 0.07826947135469714, 'num_leaves': 7}. Best is trial 24 with value: 277.5442322875385. [I 2021-01-11 15:46:00,580] Trial 26 finished with value: 279.588998042182 and parameters: {'max_depth': 37, 'n_estimators': 74, 'learning_rate': 0.12067299442478673, 'num_leaves': 9}. Best is trial 24 with value: 277.5442322875385. [I 2021-01-11 15:46:01,943] Trial 27 finished with value: 279.8573089397263 and parameters: {'max_depth': 35, 'n_estimators': 49, 'learning_rate': 0.2146237871436273, 'num_leaves': 10}. Best is trial 24 with value: 277.5442322875385. [I 2021-01-11 15:46:03,446] Trial 28 finished with value: 280.3496871593675 and parameters: {'max_depth': 40, 'n_estimators': 60, 'learning_rate': 0.1204097096691901, 'num_leaves': 8}. Best is trial 24 with value: 277.5442322875385. [I 2021-01-11 15:46:04,868] Trial 29 finished with value: 288.118798067247 and parameters: {'max_depth': 20, 'n_estimators': 69, 'learning_rate': 0.28788199955468685, 'num_leaves': 7}. Best is trial 24 with value: 277.5442322875385.
The parameters chosen by inner evaluation are:
{'max_depth': 29, 'n_estimators': 59, 'learning_rate': 0.11216807846299492, 'num_leaves': 10}
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
This parameters are taken to train the model.
The model is then tested and the outer MAE is:
309.2147528884083
optuna.visualization.plot_contour(lgbmoptuna)
As a difference of the previous contour plot, here the white portions are much more predominant, so the optimal points are more "flat".
optuna.visualization.plot_parallel_coordinate(lgbmoptuna)
optuna.visualization.plot_param_importances(lgbmoptuna)
max_depth is not so important in this case.
optuna.visualization.plot_slice(lgbmoptuna)
The most interesting graph is the one of the learning rate, which assumes interesting values from 0.1 to 0.2.
Moreover, the num_leaves graph shows a kind of negative "linear" relationship and the most interesting values are close to 10.
optuna.visualization.plot_optimization_history(lgbmoptuna)
1h) Train a CatBoost model with hyper-parameter tuning
#catboost with parameter tuning
#inner evaluation
def objective(trial):
max_depth = trial.suggest_int("max_depth",2,16)
#n_estimators = trial.suggest_int("n_estimators",2,40)
learning_rate = trial.suggest_uniform("learning_rate",0.05,0.4)
#num_leaves = trial.suggest_int("num_leaves",2,40)
clf = catboost.CatBoostRegressor(max_depth,learning_rate=learning_rate)
clf.fit(X_train, y_train)
y_val_pred = clf.predict(X_val)
inner_mae = metrics.mean_absolute_error(y_val, y_val_pred)
return inner_mae
budget=20
np.random.seed(0)
catboostoptuna = optuna.create_study(direction="minimize")
start = time.time()
catboostoptuna.optimize(objective,n_trials=budget)
print('The parameters chosen by inner evaluation are: ')
print(catboostoptuna.best_params)
#outer evaluation
clf = catboost.CatBoostRegressor(**catboostoptuna.best_params)
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
end = time.time()
totaltime_catboostoptuna = end-start
error_catboost_optuna = metrics.mean_absolute_error(y_test, y_test_pred)
print('This parameters are taken to train the model.')
print('\nThe model is then tested and the outer MAE is:')
print(error_catboost_optuna)
[I 2021-01-11 15:47:07,382] A new study created in memory with name: no-name-87e9c306-b9f0-4ecd-9899-599255492778
0: learn: 583.1725695 total: 501ms remaining: 1s 1: learn: 530.1661575 total: 697ms remaining: 348ms 2: learn: 485.2143015 total: 903ms remaining: 0us
[I 2021-01-11 15:47:11,784] Trial 0 finished with value: 405.694698776873 and parameters: {'max_depth': 3, 'learning_rate': 0.23078244549510724}. Best is trial 0 with value: 405.694698776873.
0: learn: 571.6007630 total: 201ms remaining: 1.81s 1: learn: 514.2318127 total: 352ms remaining: 1.41s 2: learn: 468.3166116 total: 513ms remaining: 1.2s 3: learn: 434.9545055 total: 683ms remaining: 1.02s 4: learn: 410.2037314 total: 853ms remaining: 853ms 5: learn: 393.9861747 total: 1.01s remaining: 671ms 6: learn: 379.5778993 total: 1.19s remaining: 512ms 7: learn: 371.0154396 total: 1.35s remaining: 338ms 8: learn: 362.4972990 total: 1.51s remaining: 168ms 9: learn: 356.7015781 total: 1.69s remaining: 0us
[I 2021-01-11 15:47:14,935] Trial 1 finished with value: 299.7329480078726 and parameters: {'max_depth': 10, 'learning_rate': 0.26669011492249844}. Best is trial 1 with value: 299.7329480078726.
0: learn: 602.8041407 total: 218ms remaining: 2.83s 1: learn: 556.9429505 total: 408ms remaining: 2.45s 2: learn: 517.5599007 total: 644ms remaining: 2.36s 3: learn: 485.3025108 total: 827ms remaining: 2.07s 4: learn: 461.1598517 total: 1.01s remaining: 1.82s 5: learn: 439.9879070 total: 1.21s remaining: 1.62s 6: learn: 423.1953338 total: 1.42s remaining: 1.42s 7: learn: 408.5446030 total: 1.69s remaining: 1.27s 8: learn: 397.5288193 total: 1.94s remaining: 1.08s 9: learn: 388.1925235 total: 2.12s remaining: 848ms 10: learn: 379.4787684 total: 2.31s remaining: 632ms 11: learn: 372.8152336 total: 2.5s remaining: 417ms 12: learn: 367.7591045 total: 2.69s remaining: 207ms 13: learn: 363.3031887 total: 2.87s remaining: 0us
[I 2021-01-11 15:47:19,658] Trial 2 finished with value: 299.9341388458653 and parameters: {'max_depth': 14, 'learning_rate': 0.1717945646740908}. Best is trial 1 with value: 299.7329480078726.
0: learn: 564.6310444 total: 201ms remaining: 1.61s 1: learn: 505.0719572 total: 395ms remaining: 1.38s 2: learn: 460.1688147 total: 647ms remaining: 1.29s 3: learn: 426.9601793 total: 918ms remaining: 1.15s 4: learn: 403.2183905 total: 1.14s remaining: 910ms 5: learn: 387.3220403 total: 1.33s remaining: 667ms 6: learn: 374.8367053 total: 1.56s remaining: 445ms 7: learn: 366.5184724 total: 1.75s remaining: 219ms 8: learn: 359.5107354 total: 1.99s remaining: 0us
[I 2021-01-11 15:47:23,425] Trial 3 finished with value: 299.5826340605421 and parameters: {'max_depth': 9, 'learning_rate': 0.2887888804104544}. Best is trial 3 with value: 299.5826340605421.
0: learn: 539.8109981 total: 181ms remaining: 2.18s 1: learn: 475.2635216 total: 398ms remaining: 2.19s 2: learn: 429.7841853 total: 629ms remaining: 2.1s 3: learn: 399.6870491 total: 918ms remaining: 2.06s 4: learn: 381.0522950 total: 1.13s remaining: 1.81s 5: learn: 370.1985085 total: 1.33s remaining: 1.56s 6: learn: 360.7439810 total: 1.55s remaining: 1.33s 7: learn: 355.6538760 total: 1.72s remaining: 1.08s 8: learn: 349.5127069 total: 1.92s remaining: 853ms 9: learn: 344.6238022 total: 2.1s remaining: 630ms 10: learn: 338.3795832 total: 2.3s remaining: 418ms 11: learn: 334.8425763 total: 2.49s remaining: 207ms 12: learn: 330.8616640 total: 2.67s remaining: 0us
[I 2021-01-11 15:47:27,758] Trial 4 finished with value: 302.5071149974481 and parameters: {'max_depth': 13, 'learning_rate': 0.37096799437245376}. Best is trial 3 with value: 299.5826340605421.
0: learn: 591.1414919 total: 190ms remaining: 1.14s 1: learn: 541.5421383 total: 365ms remaining: 912ms 2: learn: 500.4224957 total: 571ms remaining: 762ms 3: learn: 469.2578219 total: 750ms remaining: 563ms 4: learn: 440.7152362 total: 928ms remaining: 371ms 5: learn: 420.0208347 total: 1.11s remaining: 185ms 6: learn: 404.2326345 total: 1.3s remaining: 0us
[I 2021-01-11 15:47:30,892] Trial 5 finished with value: 333.07147980281275 and parameters: {'max_depth': 7, 'learning_rate': 0.20656621430182082}. Best is trial 3 with value: 299.5826340605421.
0: learn: 570.1169620 total: 197ms remaining: 986ms 1: learn: 512.2537787 total: 385ms remaining: 769ms 2: learn: 466.2592277 total: 577ms remaining: 577ms 3: learn: 432.3590145 total: 818ms remaining: 409ms 4: learn: 409.5113747 total: 1.07s remaining: 213ms 5: learn: 391.0483130 total: 1.25s remaining: 0us
[I 2021-01-11 15:47:33,734] Trial 6 finished with value: 327.7167026416268 and parameters: {'max_depth': 6, 'learning_rate': 0.2713634783364809}. Best is trial 3 with value: 299.5826340605421.
0: learn: 589.0873956 total: 284ms remaining: 1.42s 1: learn: 538.5399636 total: 605ms remaining: 1.21s 2: learn: 497.9665401 total: 892ms remaining: 892ms 3: learn: 464.9992683 total: 1.12s remaining: 560ms 4: learn: 436.9301429 total: 1.38s remaining: 275ms 5: learn: 417.5409295 total: 1.63s remaining: 0us
[I 2021-01-11 15:47:37,021] Trial 7 finished with value: 348.18540600259666 and parameters: {'max_depth': 6, 'learning_rate': 0.21277096197041723}. Best is trial 3 with value: 299.5826340605421.
0: learn: 541.2823577 total: 247ms remaining: 247ms 1: learn: 476.9036553 total: 499ms remaining: 0us
[I 2021-01-11 15:47:39,267] Trial 8 finished with value: 401.80010872974924 and parameters: {'max_depth': 2, 'learning_rate': 0.3659219460441391}. Best is trial 3 with value: 299.5826340605421.
0: learn: 636.5442745 total: 301ms remaining: 2.71s 1: learn: 613.9301186 total: 500ms remaining: 2s 2: learn: 592.6767861 total: 709ms remaining: 1.65s 3: learn: 572.8282538 total: 911ms remaining: 1.37s 4: learn: 553.7131159 total: 1.13s remaining: 1.13s 5: learn: 535.9420079 total: 1.35s remaining: 898ms 6: learn: 519.5541898 total: 1.65s remaining: 706ms 7: learn: 505.4033469 total: 2.02s remaining: 505ms 8: learn: 492.4691646 total: 2.32s remaining: 258ms 9: learn: 480.8547546 total: 2.69s remaining: 0us
[I 2021-01-11 15:47:43,544] Trial 9 finished with value: 404.0441996627152 and parameters: {'max_depth': 10, 'learning_rate': 0.07494171408025685}. Best is trial 3 with value: 299.5826340605421.
0: learn: 555.3052466 total: 226ms remaining: 3.38s 1: learn: 493.3470004 total: 429ms remaining: 3s 2: learn: 446.8259318 total: 631ms remaining: 2.74s 3: learn: 415.7642749 total: 863ms remaining: 2.59s 4: learn: 392.8941768 total: 1.16s remaining: 2.56s 5: learn: 377.8085984 total: 1.46s remaining: 2.44s 6: learn: 366.3218500 total: 1.66s remaining: 2.13s 7: learn: 358.9987390 total: 1.84s remaining: 1.84s 8: learn: 352.1018313 total: 2.04s remaining: 1.58s 9: learn: 347.3846465 total: 2.22s remaining: 1.33s 10: learn: 340.0814936 total: 2.43s remaining: 1.1s 11: learn: 336.9627918 total: 2.66s remaining: 885ms 12: learn: 333.7187832 total: 2.84s remaining: 656ms 13: learn: 332.0902209 total: 3.02s remaining: 432ms 14: learn: 328.7118701 total: 3.24s remaining: 216ms 15: learn: 325.3018229 total: 3.45s remaining: 0us
[I 2021-01-11 15:47:48,873] Trial 10 finished with value: 294.5962393298537 and parameters: {'max_depth': 16, 'learning_rate': 0.3189812312711463}. Best is trial 10 with value: 294.5962393298537.
0: learn: 554.7218078 total: 252ms remaining: 3.77s 1: learn: 492.6340870 total: 454ms remaining: 3.18s 2: learn: 445.3489595 total: 653ms remaining: 2.83s 3: learn: 412.6738594 total: 833ms remaining: 2.5s 4: learn: 391.5096122 total: 1.02s remaining: 2.25s 5: learn: 378.1154752 total: 1.22s remaining: 2.03s 6: learn: 365.4448214 total: 1.42s remaining: 1.82s 7: learn: 358.3549163 total: 1.6s remaining: 1.6s 8: learn: 351.7706538 total: 1.8s remaining: 1.4s 9: learn: 345.8691274 total: 1.98s remaining: 1.19s 10: learn: 341.2586469 total: 2.19s remaining: 998ms 11: learn: 338.5139690 total: 2.4s remaining: 801ms 12: learn: 334.3307737 total: 2.65s remaining: 610ms 13: learn: 332.7409907 total: 2.89s remaining: 414ms 14: learn: 330.0221620 total: 3.09s remaining: 206ms 15: learn: 326.5436284 total: 3.29s remaining: 0us
[I 2021-01-11 15:47:53,850] Trial 11 finished with value: 294.64727939202623 and parameters: {'max_depth': 16, 'learning_rate': 0.32089565561925915}. Best is trial 10 with value: 294.5962393298537.
0: learn: 551.4849266 total: 246ms remaining: 3.68s 1: learn: 488.7236958 total: 455ms remaining: 3.18s 2: learn: 440.3842458 total: 651ms remaining: 2.82s 3: learn: 412.2015800 total: 832ms remaining: 2.5s 4: learn: 392.5590294 total: 1.02s remaining: 2.25s 5: learn: 378.0920954 total: 1.25s remaining: 2.08s 6: learn: 367.0127360 total: 1.53s remaining: 1.97s 7: learn: 360.2597571 total: 1.75s remaining: 1.75s 8: learn: 353.0079589 total: 1.95s remaining: 1.52s 9: learn: 348.6178111 total: 2.16s remaining: 1.29s 10: learn: 344.3725655 total: 2.36s remaining: 1.07s 11: learn: 341.6427092 total: 2.56s remaining: 855ms 12: learn: 339.5357766 total: 2.75s remaining: 635ms 13: learn: 336.2377515 total: 3.03s remaining: 433ms 14: learn: 333.5543395 total: 3.29s remaining: 219ms 15: learn: 329.7878823 total: 3.48s remaining: 0us
[I 2021-01-11 15:47:58,979] Trial 12 finished with value: 299.16991606366526 and parameters: {'max_depth': 16, 'learning_rate': 0.3315746232262441}. Best is trial 10 with value: 294.5962393298537.
0: learn: 534.0088004 total: 211ms remaining: 3.17s 1: learn: 468.9577450 total: 381ms remaining: 2.66s 2: learn: 424.1946708 total: 567ms remaining: 2.46s 3: learn: 395.5634150 total: 769ms remaining: 2.31s 4: learn: 378.1816164 total: 940ms remaining: 2.07s 5: learn: 368.1707917 total: 1.11s remaining: 1.85s 6: learn: 358.8706125 total: 1.29s remaining: 1.66s 7: learn: 354.8673776 total: 1.47s remaining: 1.47s 8: learn: 349.8599995 total: 1.67s remaining: 1.29s 9: learn: 345.5012061 total: 1.85s remaining: 1.11s 10: learn: 340.3805022 total: 2.03s remaining: 921ms 11: learn: 334.1673524 total: 2.22s remaining: 740ms 12: learn: 332.3355541 total: 2.42s remaining: 558ms 13: learn: 329.3447366 total: 2.63s remaining: 376ms 14: learn: 325.6350293 total: 2.83s remaining: 189ms 15: learn: 321.2992287 total: 3.01s remaining: 0us
[I 2021-01-11 15:48:04,199] Trial 13 finished with value: 307.16392145957786 and parameters: {'max_depth': 16, 'learning_rate': 0.3911155337989625}. Best is trial 10 with value: 294.5962393298537.
0: learn: 550.4446381 total: 231ms remaining: 2.77s 1: learn: 487.4831677 total: 416ms remaining: 2.29s 2: learn: 441.2488943 total: 581ms remaining: 1.94s 3: learn: 408.7804308 total: 763ms remaining: 1.72s 4: learn: 388.5519232 total: 946ms remaining: 1.51s 5: learn: 374.5880881 total: 1.13s remaining: 1.32s 6: learn: 363.4195346 total: 1.43s remaining: 1.23s 7: learn: 356.4715855 total: 1.65s remaining: 1.03s 8: learn: 349.8641488 total: 1.83s remaining: 814ms 9: learn: 345.2580395 total: 2.02s remaining: 606ms 10: learn: 341.0657822 total: 2.21s remaining: 403ms 11: learn: 336.4418515 total: 2.49s remaining: 207ms 12: learn: 334.0756355 total: 2.69s remaining: 0us
[I 2021-01-11 15:48:08,573] Trial 14 finished with value: 294.04472224969726 and parameters: {'max_depth': 13, 'learning_rate': 0.33502803572448386}. Best is trial 14 with value: 294.04472224969726.
0: learn: 549.9384475 total: 265ms remaining: 3.18s 1: learn: 486.8824130 total: 527ms remaining: 2.9s 2: learn: 440.6670382 total: 840ms remaining: 2.8s 3: learn: 408.3004947 total: 1.08s remaining: 2.43s 4: learn: 388.1821696 total: 1.26s remaining: 2.02s 5: learn: 374.3180418 total: 1.45s remaining: 1.69s 6: learn: 363.2363642 total: 1.64s remaining: 1.41s 7: learn: 356.3376938 total: 1.86s remaining: 1.16s 8: learn: 349.7574536 total: 2.05s remaining: 912ms 9: learn: 345.0961752 total: 2.25s remaining: 675ms 10: learn: 340.6810091 total: 2.47s remaining: 450ms 11: learn: 337.0101687 total: 2.69s remaining: 224ms 12: learn: 334.3359973 total: 2.9s remaining: 0us
[I 2021-01-11 15:48:13,285] Trial 15 finished with value: 294.2057967804683 and parameters: {'max_depth': 13, 'learning_rate': 0.33671224962129764}. Best is trial 14 with value: 294.04472224969726.
0: learn: 618.9671635 total: 212ms remaining: 2.55s 1: learn: 584.0239000 total: 404ms remaining: 2.22s 2: learn: 551.3522013 total: 589ms remaining: 1.96s 3: learn: 522.9808443 total: 787ms remaining: 1.77s 4: learn: 498.6089126 total: 976ms remaining: 1.56s 5: learn: 477.3347160 total: 1.22s remaining: 1.43s 6: learn: 458.8171473 total: 1.41s remaining: 1.21s 7: learn: 443.9250294 total: 1.61s remaining: 1.01s 8: learn: 430.6486791 total: 1.86s remaining: 828ms 9: learn: 419.6236446 total: 2.13s remaining: 638ms 10: learn: 408.5181106 total: 2.37s remaining: 431ms 11: learn: 399.4750785 total: 2.65s remaining: 221ms 12: learn: 392.6880773 total: 2.89s remaining: 0us
[I 2021-01-11 15:48:17,785] Trial 16 finished with value: 324.78691568763287 and parameters: {'max_depth': 13, 'learning_rate': 0.12476569550073383}. Best is trial 14 with value: 294.04472224969726.
0: learn: 531.6142005 total: 243ms remaining: 2.67s 1: learn: 466.4315888 total: 423ms remaining: 2.12s 2: learn: 422.0234723 total: 611ms remaining: 1.83s 3: learn: 396.0538134 total: 799ms remaining: 1.6s 4: learn: 378.9737102 total: 1.02s remaining: 1.43s 5: learn: 366.6125413 total: 1.22s remaining: 1.22s 6: learn: 358.7622264 total: 1.44s remaining: 1.03s 7: learn: 354.7485166 total: 1.65s remaining: 827ms 8: learn: 348.2516244 total: 1.91s remaining: 636ms 9: learn: 344.0458553 total: 2.15s remaining: 431ms 10: learn: 336.9835145 total: 2.35s remaining: 214ms 11: learn: 331.7875911 total: 2.56s remaining: 0us
[I 2021-01-11 15:48:22,129] Trial 17 finished with value: 306.8892925557088 and parameters: {'max_depth': 12, 'learning_rate': 0.39955247774421987}. Best is trial 14 with value: 294.04472224969726.
0: learn: 546.6816179 total: 282ms remaining: 2.82s 1: learn: 483.0623511 total: 487ms remaining: 2.19s 2: learn: 437.0099686 total: 797ms remaining: 2.12s 3: learn: 407.9766841 total: 1.06s remaining: 1.86s 4: learn: 388.3820521 total: 1.27s remaining: 1.53s 5: learn: 373.5515637 total: 1.48s remaining: 1.23s 6: learn: 363.8684082 total: 1.67s remaining: 953ms 7: learn: 358.0982296 total: 1.85s remaining: 695ms 8: learn: 351.1881495 total: 2.05s remaining: 456ms 9: learn: 346.6799490 total: 2.25s remaining: 225ms 10: learn: 341.9004984 total: 2.48s remaining: 0us
[I 2021-01-11 15:48:26,519] Trial 18 finished with value: 298.5447212155026 and parameters: {'max_depth': 11, 'learning_rate': 0.3476099931300009}. Best is trial 14 with value: 294.04472224969726.
0: learn: 532.9869381 total: 281ms remaining: 3.66s 1: learn: 467.8742433 total: 506ms remaining: 3.04s 2: learn: 423.2584911 total: 712ms remaining: 2.61s 3: learn: 394.8954225 total: 910ms remaining: 2.27s 4: learn: 377.7323728 total: 1.09s remaining: 1.97s 5: learn: 367.0286930 total: 1.27s remaining: 1.7s 6: learn: 358.2991936 total: 1.46s remaining: 1.46s 7: learn: 351.6169257 total: 1.64s remaining: 1.23s 8: learn: 346.5194841 total: 1.84s remaining: 1.02s 9: learn: 342.2519813 total: 2.02s remaining: 807ms 10: learn: 336.6936298 total: 2.21s remaining: 604ms 11: learn: 331.6668259 total: 2.4s remaining: 400ms 12: learn: 327.5939696 total: 2.65s remaining: 204ms 13: learn: 325.7396180 total: 2.85s remaining: 0us
[I 2021-01-11 15:48:30,959] Trial 19 finished with value: 300.0448196576561 and parameters: {'max_depth': 14, 'learning_rate': 0.3947068471859097}. Best is trial 14 with value: 294.04472224969726.
The parameters chosen by inner evaluation are:
{'max_depth': 13, 'learning_rate': 0.33502803572448386}
0: learn: 549.0310916 total: 27.3s remaining: 7h 34m 2s
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-35-eb38b5f51e78> in <module> 23 #outer evaluation 24 clf = catboost.CatBoostRegressor(**catboostoptuna.best_params) ---> 25 clf.fit(X_train, y_train) 26 y_test_pred = clf.predict(X_test) 27 end = time.time() ~\anaconda3\lib\site-packages\catboost\core.py in fit(self, X, y, cat_features, sample_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model) 4847 self._check_is_regressor_loss(params['loss_function']) 4848 -> 4849 return self._fit(X, y, cat_features, None, None, None, sample_weight, None, None, None, None, baseline, 4850 use_best_model, eval_set, verbose, logging_level, plot, column_description, 4851 verbose_eval, metric_period, silent, early_stopping_rounds, ~\anaconda3\lib\site-packages\catboost\core.py in _fit(self, X, y, cat_features, text_features, embedding_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model) 1804 1805 with log_fixup(), plot_wrapper(plot, [_get_train_dir(self.get_params())]): -> 1806 self._train( 1807 train_pool, 1808 train_params["eval_sets"], ~\anaconda3\lib\site-packages\catboost\core.py in _train(self, train_pool, test_pool, params, allow_clear_pool, init_model) 1256 1257 def _train(self, train_pool, test_pool, params, allow_clear_pool, init_model): -> 1258 self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None) 1259 self._set_trained_model_attributes() 1260 _catboost.pyx in _catboost._CatBoost._train() _catboost.pyx in _catboost._CatBoost._train() KeyboardInterrupt:
The outer evaluation was extremely slow, the kernel had to be stopped.
#dummy regressor to compare with
np.random.seed(0)
dummy = DummyRegressor()
start = time.time()
dummy = dummy.fit(X_train,y_train)
end = time.time()
totaltime_dummy = end-start
y_test_pred = dummy.predict(X_test)
error_dummy = metrics.mean_absolute_error(y_test,y_test_pred)
print('The error (MAE) of the Dummy Regressor is: ')
print('test set: ', error_dummy)
The error (MAE) of the Dummy Regressor is: test set: 556.5503579633098
error_tot = [error_dummy,error_knn,error_knn_optuna,error_randfor_notun,error_randfor_optuna,error_xgb_notun,error_xgboost_optuna,error_lgbm_optuna]
time_tot = [totaltime_dummy,totaltime_knn,totaltime_knnoptuna,totaltime_randfor_notun,totaltime_randforoptuna,totaltime_xgb_notun,totaltime_xgboostoptuna,totaltime_lgbmoptuna]
plt.figure()
plt.figure(figsize=(10,5))
plt.plot(list(range(len(error_tot))),error_tot, 'b')
plt.xticks(list(range(len(error_tot))), ['Dummy','KNN','KNN optuna','Random Forest','Random Forest Optuna','XGBoost','XGBoost Optuna','Lightboost Optuna'])
plt.xticks(rotation=45, ha="right")
plt.suptitle('Mean absolute error')
plt.ylabel('MAE')
plt.plot(list(range(len(error_tot)))[np.argmin(error_tot)], min(error_tot), 'ro')
plt.figure()
plt.figure(figsize=(10,5))
plt.plot(list(range(len(time_tot))),time_tot, 'b')
plt.xticks(list(range(len(time_tot))), ['Dummy','KNN','KNN optuna','Random Forest','Random Forest Optuna','XGBoost','XGBoost Optuna','Lightboost Optuna','Catboost Optuna'])
plt.xticks(rotation=45, ha="right")
plt.suptitle('Elapsed Time')
plt.ylabel('Time [s]')
Text(0, 0.5, 'Time [s]')
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
The best model among the ones that have been analyzed is the one with Random Forest performed with hyper-parameter tuning.
The Lightboost with hyper-parameter tuning is interesting as well: the result is quite good considering the huge speed of the algorithm.
In general, the K Neighbors methods are the worse ones, and the ensemble methods are in between.
Talking about elapsed time, the Random forests are the ones that take more time, while the XGBoost seems a good compromise in terms of performance and elapsed time.
Please note: in order to answer these questions, you should read the Attribute Selection notebook and understand the main ideas about SelectKBest and Pipeline. Use SelectKBest and Pipeline (and whatever else you need) in order to find a subset of attributes that allows to build an accurate Decision Tree mode. Use the best model of the previous section, but if it is too slow, use Decision Trees (they're faster).
Use the test partition in order to compare different models.
To check the optimal number of attributes, the SelectKBest function is used inside a pipeline which contains also the regressor. The pipeline itself is then a method.
In this case, the Random Forest Regressor is chosen, but to speed-up the processing time, only one parameter is taken (regression__max_depth).
For the same reason, the range for the parameter select__k is only multiples of 25 until 200.
np.random.seed(0)
budget = 2
pipeline = Pipeline([('select',SelectKBest(f_regression)),('regression',RandomForestRegressor())])
param_grid = {"select__k":list(range(25,200,25)),"regression__max_depth":list(range(2,40,5))}
pipeline_model = RandomizedSearchCV(pipeline, param_distributions=param_grid, n_iter=budget)
start = time.time()
fit = pipeline_model.fit(X_train,y_train)
end = time.time()
totaltime_pipeline = end-start
y_test_pred = pipeline_model.predict(X_test)
error_pipeline = metrics.mean_absolute_error(y_test,y_test_pred)
print('The error (MAE) of the Pipeline is: ')
print('test set: ', error_pipeline)
print('\nThe optimal number of attributes is ', list(pipeline_model.best_params_.values())[0])
The error (MAE) of the Pipeline is: test set: 309.00071367413506 The optimal number of attributes is 150
Let's plot now the scores of each attribute, to check are the most important attributes of the dataset.
data = pd.read_pickle('wind_pickle.pickle')
X = data.iloc[:, 6:len(data.columns)]
y=data.iloc[:,0]
best_features = SelectKBest(f_regression)
fit = best_features.fit(X,y)
df_scores = pd.DataFrame(fit.scores_)
df_columns = pd.DataFrame(X.columns)
feature_scores = pd.concat([df_columns, df_scores],axis=1)
feature_scores.columns = ['Feature_Name','Score']
best_features = feature_scores.nlargest(550,'Score')
print(best_features)
plt.figure()
plt.plot(list(range(len(best_features))), best_features["Score"], 'b')
plt.suptitle('Feature Selection')
plt.ylabel('Score')
plt.xlabel('Attribute')
Feature_Name Score 75 p59.162.1 1441.697536 80 p59.162.6 1433.187379 76 p59.162.2 1431.987529 85 p59.162.11 1426.161598 81 p59.162.7 1423.265021 .. ... ... 25 p55.162.1 1.235526 30 p55.162.6 1.229372 35 p55.162.11 1.223969 40 p55.162.16 1.219226 45 p55.162.21 0.989541 [550 rows x 2 columns]
Text(0.5, 0, 'Attribute')
It can be seen that the most important attributes are not the ones of Sotavento.
The most important Sotavento attribute, indeed, is not even in the first 10 attributes!
This shows that it's important to consider other spots rather than only the ones of real interest.
The graph shows as well that there's a big drop in the score, after the 25th attribute.